Skip to main content

Pandas Intro

Pandas stands for python annotated n-dimensional arrays, which describes its functionality pretty well. Pandas is a library for reading, changing, creating and analysing data. Thanks to Pandas being built on numpy it also has very high performance.

import numpy as np
import pandas as pd

Pandas consists of mainly 2 data structures, Series and Dataframes.

Series

A Series is a 1D labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.) however it can only contain one data type at a time. A Series also has an index, a list of axis labels or identifiers, which you can see on the left.

In simple words you can imagine a Series as a column.

From List

series_list = pd.Series([1, 2, 3, 4, 5, 6, 7])
series_list
0    1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
type(series_list)
pandas.core.series.Series

Like a numpy array, a pandas Series has a dtype. This is often a numpy dtype but pandas has also added a few of its own.

series_list.dtype
dtype('int64')

Also just like a numpy array, you can look at the shape i.e the dimensions of a Series/DataFrame.

series_string = pd.Series(["Bob", "John", "Joe"])
series_string.shape
(3,)
series_np = pd.Series(np.array([10, 20, 30, 40, 50, 60]))
series_np
0    10
1 20
2 30
3 40
4 50
5 60
dtype: int64

From Dictionary

series_dict = pd.Series({"a": 0.0, "b": 1.0, "c": 2.0})
series_dict
a    0.0
b 1.0
c 2.0
dtype: float64

Custom indexes

Pandas generates by default an index for us which is a sequence of integers but this can be changed.

indexes = np.arange(0, 5, 1)
series_index = pd.Series(
[2 ** x for x in indexes],
index=indexes
)
series_index
0     1
1 2
2 4
3 8
4 16
dtype: int64
series_string_index = pd.Series(np.arange(1, 6, 1), index=["a", "b", "c", "d", "e"])
series_string_index
a    1
b 2
c 3
d 4
e 5
dtype: int64

Iterating over Series

Just like most other data structures in python, it is possible to iterate over them using simple for loops

for value in series_string_index:
print(value)
1
2
3
4
5
for key in series_string_index.keys():  # key being the row index
print(key)
a
b
c
d
e
for index in series_string_index.index:
print(index)
a
b
c
d
e

DataFrame

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series/List objects.

If we consider column names as keys and lists of items under that column as values, we can easily use a python dict to represent a DataFrame. Just like with the Series if we haven’t provided a row index it automatically generates a sequence of integers.

From dict of lists/numpy arrays

my_dict = {
"name": ["a", "b", "c", "d", "e", "f", "g"],
"age": [20, 27, 35, 55, 18, 21, 35],
"designation": ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}

df_dict = pd.DataFrame(my_dict)
df_dict
name age designation
0 a 20 VP
1 b 27 CEO
2 c 35 CFO
3 d 55 VP
4 e 18 VP
5 f 21 CEO
6 g 35 MD
df_index = pd.DataFrame(
my_dict,
index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"]
)
df_index
name age designation
First a 20 VP
Second b 27 CEO
Third c 35 CFO
Fourth d 55 VP
Fifth e 18 VP
Sixth f 21 CEO
Seventh g 35 MD

From CSV

df_cars = pd.read_csv("data/cars.csv", sep=";", skiprows=[1])
df_cars
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
0 Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
1 Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
2 Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
3 AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
4 Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
... ... ... ... ... ... ... ... ... ...
401 Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
402 Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
403 Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
404 Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
405 Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

406 rows × 9 columns

Meta Information

We can also look at some basic information of our DataFrame.

df_cars.shape
(406, 9)
df_cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Car 406 non-null object
1 MPG 406 non-null float64
2 Cylinders 406 non-null int64
3 Displacement 406 non-null float64
4 Horsepower 406 non-null float64
5 Weight 406 non-null float64
6 Acceleration 406 non-null float64
7 Model 406 non-null int64
8 Origin 406 non-null object
dtypes: float64(5), int64(2), object(2)
memory usage: 28.7+ KB

If the dtype is object then it is in most cases a string. We can also look at the first or last data entries

df_cars.head()  # First 10
df_cars.head(5) # First 5
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
0 Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
1 Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
2 Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
3 AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
4 Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
df_cars.tail()  # Last 10
df_cars.tail(5) # Last 5
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
401 Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
402 Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
403 Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
404 Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
405 Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

Accesing Data

df_cars.columns
Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model', 'Origin'],
dtype='object')
df_cars["Car"]
0      Chevrolet Chevelle Malibu
1 Buick Skylark 320
2 Plymouth Satellite
3 AMC Rebel SST
4 Ford Torino
...
401 Ford Mustang GL
402 Volkswagen Pickup
403 Dodge Rampage
404 Ford Ranger
405 Chevy S-10
Name: Car, Length: 406, dtype: object

This is also works but other is preferred as this can cause clashes with existing DataFrame attributes for example count

df_cars.Car
0      Chevrolet Chevelle Malibu
1 Buick Skylark 320
2 Plymouth Satellite
3 AMC Rebel SST
4 Ford Torino
...
401 Ford Mustang GL
402 Volkswagen Pickup
403 Dodge Rampage
404 Ford Ranger
405 Chevy S-10
Name: Car, Length: 406, dtype: object

You can also get Multiple columns.

df_cars[["Car", "MPG", "Cylinders"]]
Car MPG Cylinders
0 Chevrolet Chevelle Malibu 18.0 8
1 Buick Skylark 320 15.0 8
2 Plymouth Satellite 18.0 8
3 AMC Rebel SST 16.0 8
4 Ford Torino 17.0 8
... ... ... ...
401 Ford Mustang GL 27.0 4
402 Volkswagen Pickup 44.0 4
403 Dodge Rampage 32.0 4
404 Ford Ranger 28.0 4
405 Chevy S-10 31.0 4

406 rows × 3 columns

The iloc function stands for integer location and allows us to easly access data by using there indexes.

df_cars.iloc[0]  # row with index 0
Car             Chevrolet Chevelle Malibu
MPG 18.0
Cylinders 8
Displacement 307.0
Horsepower 130.0
Weight 3504.0
Acceleration 12.0
Model 70
Origin US
Name: 0, dtype: object

We can also multiple rows and specify certain columns but as we are using iloc it has to be the indexes of the columns.

df_cars.iloc[[0, 1], [0, 2, 3]]  # row 0 and 1 with 3 columns
Car Cylinders Displacement
0 Chevrolet Chevelle Malibu 8 307.0
1 Buick Skylark 320 8 350.0

We can also us slicing just like with python lists.

df_cars.iloc[0:5, 0:3]
Car MPG Cylinders
0 Chevrolet Chevelle Malibu 18.0 8
1 Buick Skylark 320 15.0 8
2 Plymouth Satellite 18.0 8
3 AMC Rebel SST 16.0 8
4 Ford Torino 17.0 8

To access data there is also the loc function which works the same way as iloc but uses the labels not the indexes.

df_cars.loc[[0, 1], ["Car", "Cylinders"]]
Car Cylinders
0 Chevrolet Chevelle Malibu 8
1 Buick Skylark 320 8
df_index.loc[["First", "Second"]]
name age designation
First a 20 VP
Second b 27 CEO

You can also use slices with loc important here is however that the end values is inclusive!

df_cars.loc[0:5, "Car":"Cylinders"]
Car MPG Cylinders
0 Chevrolet Chevelle Malibu 18.0 8
1 Buick Skylark 320 15.0 8
2 Plymouth Satellite 18.0 8
3 AMC Rebel SST 16.0 8
4 Ford Torino 17.0 8
5 Ford Galaxie 500 15.0 8
df_cars["Cylinders"].value_counts()
4    207
8 108
6 84
3 4
5 3
Name: Cylinders, dtype: int64

Indexes

As we have seen so far indexes are identifiers for rows. By default an index for a DataFrame or Series is just an integer sequence but as we have seen it can also be something different for example we can set a certain column as the index.

df_cars.set_index("Car")
MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
Car
Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
... ... ... ... ... ... ... ... ...
Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

406 rows × 8 columns

We can however see that this change is not permenant. This is often the case in pandas when operating on Series or DataFrames. If we wanted these changes to be permenant we could call the method with the inplace parameter as True.

df_cars
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
0 Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
1 Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
2 Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
3 AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
4 Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
... ... ... ... ... ... ... ... ... ...
401 Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
402 Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
403 Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
404 Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
405 Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

406 rows × 9 columns

df_cars.set_index("Car", inplace=True)
df_cars
MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
Car
Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
... ... ... ... ... ... ... ... ...
Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

406 rows × 8 columns

You can also sort the rows by using the index

df_cars.sort_index()
MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
Car
AMC Ambassador Brougham 13.0 8 360.0 175.0 3821.0 11.0 73 US
AMC Ambassador DPL 15.0 8 390.0 190.0 3850.0 8.5 70 US
AMC Ambassador SST 17.0 8 304.0 150.0 3672.0 11.5 72 US
AMC Concord 19.4 6 232.0 90.0 3210.0 17.2 78 US
AMC Concord 24.3 4 151.0 90.0 3003.0 20.1 80 US
... ... ... ... ... ... ... ... ...
Volvo 145e (sw) 18.0 4 121.0 112.0 2933.0 14.5 72 Europe
Volvo 244DL 22.0 4 121.0 98.0 2945.0 14.5 75 Europe
Volvo 245 20.0 4 130.0 102.0 3150.0 15.7 76 Europe
Volvo 264gl 17.0 6 163.0 125.0 3140.0 13.6 78 Europe
Volvo Diesel 30.7 6 145.0 76.0 3160.0 19.6 81 Europe

406 rows × 8 columns

df_cars.reset_index(inplace=True)
df_cars
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
0 Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
1 Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
2 Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
3 AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
4 Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
... ... ... ... ... ... ... ... ... ...
401 Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
402 Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
403 Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
404 Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
405 Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US

406 rows × 9 columns